Compressing Trigram Language Models With Golomb Coding

نویسندگان

  • Kenneth Church
  • Ted Hart
  • Jianfeng Gao
چکیده

Trigram language models are compressed using a Golomb coding method inspired by the original Unix spell program. Compression methods trade off space, time and accuracy (loss). The proposed HashTBO method optimizes space at the expense of time and accuracy. Trigram language models are normally considered memory hogs, but with HashTBO, it is possible to squeeze a trigram language model into a few megabytes or less. HashTBO made it possible to ship a trigram contextual speller in Microsoft Office 2007.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

N-Gram Language Model Compression Using Scalar Quantization and Incremental Coding

This paper describes a novel approach of compressing large trigram language models, which uses scalar quantization to compress log probabilities and back-off coefficients, and incremental coding to compress entry pointers. Experiments show that the new approach achieves roughly 2.5 times of compression ratio compared to the well-known tree-bucket format while keeps the perplexity and accessing ...

متن کامل

Generalized Golomb Codes and Adaptive Coding of Wavelet-Transformed Image Subbands

We describe a class of prefix-free codes for the nonnegative integers. We apply a family of codes in this class to the problem of runlength coding, specifically as part of an adaptive algorithm for compressing quantized subbands of wavelettransformed images. On test images, our adaptive coding algorithm is shown to give compression effectiveness comparable to the best performance achievable by ...

متن کامل

Combining word prediction and r-ary Huffman coding for text entry

Two approaches to reducing effort in switch-based text entry for augmentative and alternative communication devices are word prediction and efficient coding schemes, such as Huffman. However, character distributions that inform the latter have never accounted for the use of the former. In this paper, we provide the first combination of Huffman codes and word prediction, using both trigram and l...

متن کامل

Compressing Integers for Fast File Access

Fast access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for example, in large internet search systems and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retrieving a compressed representation from disk and the CPU cost of decodi...

متن کامل

Efficient Data Compression Technique Using Modified Adaptive Rice Golomb Coding for Wireless Sensor Network

Wireless sensor networks (WSN) are energy constrained network since each node in WSNs are typically powered by batteries with limited capacity. Compressing the data sensed at each sensor node in an energy efficient manner is necessary for extending the network lifetime of wireless sensor network. In each sensor node the communication module is the main energy consuming unit and therefore data c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007